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60287-MA:  The  human  microbiome  as  a  multipurpose  biomarker 

Associate  Professor  Curtis  Huttenhower,  Department  of  Biostatistics,  Harvard  School  of  Public  Health 

Our  final  technical  report  for  this  project  includes  1)  a  method  for  uniquely  identifying  human 
individuals  by  way  of  their  personal  microbial  communities  using  metagenomic  codes,  2)  a  novel 
Bayesian  method  (CCREPE)  to  identify  robust  correlation  and  co-exclusion  patterns  in 
compositional  data,  3)  a  generative  Bayesian  model,  SparseDOSSA,  for  generating  microbial 
community  data  with  known  covariation  and  association  patterns,  and  4)  the  PICRUSt  method 
and  software  for  metagenome  inference  from  taxonomic  profiles  using  ancestral  state 
reconstruction.  Open-source  software  implementations  for  all  four  main  results  are  available  at 
http://huttenhower.sph.harvard.edu/idabihtv,  http://huttenhower.sph.harvard.edu/ccrepe, 

http://huttenhower.sph.harvard.edu/sparsedossa,  and  http://huttenhower.sph.harvard.edu/picrust, 
respectively,  with  additional  material  at  http://huttenhower.sph.harvard.edu/galaxy  and 
http://bitbucket.org/biobakery.  Our  final  publication  list  includes  14  manuscripts  (PMIDs 
25964341,  24629344,  23013615,  26157614,  25303518,  22699609,  22699610,  26418763, 
25587358,  22904687,  22698087,  24843156,  23975157,  and  22807668)  and  two  currently  in 
preparation  (CCREPE  and  SparseDOSSA). 

Problem  Statement 

The  human  microbiome  comprises  the  communities  of  microbes  carried  in  and  on  the  body  in 
health  and  disease,  including  trillions  of  bacteria,  viruses,  archaea,  and  fungi  per  individual. 
These  microbiota,  which  defend  us  against  pathogens  and  help  digest  our  food,  are  personalized 
among  individuals.  The  specific  microbes  present  at  any  one  habitat  within  an  individual 
become  relatively  stable  during  the  first  several  years  of  life,  but  change  in  as-yet- 
uncharacterized  ways  as  a  host  is  exposed  to  new  environments,  diets,  locations,  and  social 
contacts.  The  microbial  composition  of  a  given  individual  might  thus  be  linked  to  his  genetic 
background  or  early  life  history,  for  example,  while  the  metabolism  of  those  microbes  would 
reveal  more  about  his  recent  medical  history  or  diet.  The  goals  of  this  project  are  thus  1)  to 
assess  the  structure  of  any  microbial  habitat  and  its  potential  for  identifiability  (including  dietary 
history,  medical  history,  biometric,  demographics  or  environmental  exposures)  and  2)  to 
determine  the  relationships  and  interactions  between  the  microbial  communities  within  a  host. 

Results  Summary 

Identifying  personal  microbiomes  using  metagenomic  codes 

Earge-scale  investigations  of  the  human  microbiome  have  revealed  great  variability  in  the  body 
site-specific  taxonomic  composition  of  organisms  across  healthy  individuals.  However,  it  was 
not  previously  known  whether  this  variability  is  sufficiently  nonrandom  to  uniquely  identify 
individuals  within  a  population,  nor  whether  it  is  also  sufficiently  stable  to  continue  uniquely 
identifying  individuals  over  long  time  periods  (weeks,  months,  or  years).  We  answered  these 
questions  by  developing  a  hitting  set-based  coding  algorithm,  which  defined  body  site-specific 
metagenomic  codes:  sets  of  microbial  taxa  or  genes  prioritized  to  uniquely  and  stably  identify 
individuals.  Codes  capturing  strain  variation  in  clade-specific  marker  genes  were  able  to 
distinguish  among  hundreds  of  individuals  at  an  initial  sampling  time  point.  In  comparisons  with 
follow-up  samples  collected  30-300  days  later,  ~30%  of  individuals  could  still  be  uniquely 
pinpointed  using  metagenomic  codes  from  a  typical  body  site. 

Codes  based  on  the  gut  microbiome  were  exceptionally  stable  and  pinpointed  >80%  of 
individuals.  The  failure  of  a  code  to  match  its  owner  at  a  later  time  point  was  largely  explained 


by  the  loss  of  specific  microbial  strains  (at  current  limits  of  detection)  and  was  only  weakly 
associated  with  the  length  of  the  sampling  interval.  In  addition  to  highlighting  patterns  of 
temporal  variation  in  the  ecology  of  the  human  microbiome,  this  work  demonstrated  the 
feasibility  of  microbiome-based  identifiability  for  the  first  time,  a  result  with  important  ethical 
implications  for  microbiome  study  design. 

In  order  to  construct  metagenomic  codes  that  are  stable  over  time,  we  first  identified  properties 
of  individual  microbial  features  (OTUs,  genes,  and  genomic  regions)  that  lead  to  their  repeated 
detection  over  multiple  sampling  time  points  (Fig.  1).  Features  with  low  prevalence  in  the 
population  tended  to  disappear  with  the  passage  of  time,  making  them  poor  choices  for 
maximizing  code  stability;  abundant  features  were  more  likely  to  be  stable  over  time.  Trends  in 
prevalence,  abundance,  and  persistence  were  conserved  across  different  body  sites  and  across 
features  assayed  by  different  technologies,  and  they  were  also  consistent  regardless  of  whether 
features  were  defined  as  16S  rRNA  gene -based  OTUs,  metagenomic  marker  gene  sequences,  or 
one-kilobase  genomic  window  abundances  (genomic  regions). 


Figure  1:  An  overview  of  uniquely  identifiable  metagenomic  code  derivation  and  validation.  A)  An  example  of 
three  individuals  and  their  metagenomic  features  (represented  by  capital  letters)  are  shown.  For  each  individual,  a 
subset  of  features  is  highlighted  that  is  unique  among  the  three  individuals.  We  refer  to  these  sets  as  metagenomic 
codes.  B)  The  same  three  individuals  reevaluated  after  weeks  to  months.  Individual  I’s  microbiome  has  remained 
stable,  and  his  code  still  uniquely  identifies  him  among  the  population  (a  true  positive).  Individual  2  has  lost 
metagenomic  feature  C,  and  his  code  no  longer  identifies  him  (a  false  negative).  Individual  3  has  lost  feature  B  and 
gained  feature  C.  Individual  3  is  still  a  true  positive  with  respect  to  his  own  code,  but  also  matches  individual  2’s 
code  (a  false  positive).  C)  Illustration  of  the  four  metagenomic  feature  types  considered  in  our  work:  OTUs,  species, 
kilobase  windows  from  reference  genomes  (kbwindows),  and  species-specific  marker  genes  (markers). 


To  construct  codes  for  each  sample  and  feature  type,  we  divided  all  features  into  three  categories 
within  each  individual:  (i)  confidently  detected,  (ii)  confidently  absent,  and  (iii)  ambiguous. 
These  categories  were  defined  by  a  pair  of  cutoffs  that  varied  by  feature  type.  For  example,  an 
OTU  was  considered  confidently  detected  if  its  relative  abundance  met  or  exceeded  0.001 
(0.1%),  confidently  absent  if  its  relative  abundance  was  below  10-5  (0.001%),  and  ambiguous 
otherwise.  We  defined  the  features  comprising  the  metagenomic  code  of  an  individual  to  have 
two  critical  properties:  (i)  all  code  features  were  confidently  detected  in  that  individual  at  the 
first  sampling  time  point,  and  (ii)  at  least  one  code  feature  was  confidently  absent  in  each  other 
individual  in  the  population  (thus  making  their  ensemble  unique  to  the  encoded  individual). 


To  construct  a  code  for  an  individual  X,  we  first  ranked  the  confidently-detected  features  of  X  in 
order  of  increasing  “abundance  gap,”  which  we  defined  as  the  abundance  of  the  feature  in  X 
minus  its  next-highest  observed  abundance  in  the  population.  This  ranking  scheme  prioritized  the 
inclusion  of  abundant  features,  which  were  initially  determined  to  be  most  stable,  and  penalized 
the  inclusion  of  features  ambiguously  detected  in  other  individuals,  which  would  be  more  likely 


to  produce  false  positives  at  future  time  points.  We  then  added  features  from  this  ranked  list  to  a 
putative  eode,  at  eaeh  step  “flagging”  individuals  in  the  population  for  whom  the  next-highest 
ranked  feature  was  eonfidently  absent  (features  that  would  fail  to  flag  new  individuals  eontribute 
no  new  information  and  are  skipped).  The  algorithm  terminates  when  (i)  all  other  individuals 
have  been  flagged  (i.e.  the  putative  code  eontains  at  least  one  feature  that  is  eonfidently  absent  in 
eaeh  other  individual,  thus  making  it  a  unique  eode)  or  (ii)  we  run  out  of  ranked  features  before 
flagging  all  other  individuals,  in  whieh  ease  we  have  failed  to  eonstruet  a  eode  for  X.  Optionally, 
after  assembling  a  unique  eode  (ease  i),  we  ean  eontinue  adding  features  to  the  code  until  a 
desired  minimum  size  is  reaehed  (7  features  in  our  evaluations).  A  minimum  code  size  adds 
robustness  to  noise  and,  effectively,  error  eorreetion  to  avoid  false  positives. 

Our  eoding  algorithm  was  able  to  eonstruet  unique  eodes  for  almost  all  individuals  by  foeusing 
on  metagenomieally-unique  marker  genes.  These  eodes  were  stable  in  roughly  half  of  the 
population  between  the  first  and  seeond  sampling  time  points  (True  Positive  Rate,  TPR  =  50%), 
with  relatively  low  false  positive  rates.  The  isolated  stool  body  site  was  an  exception,  having  a 
TPR  eloser  to  80%.  False  negatives  were  due  largely  to  the  disappearanee  of  one  or  more  of  the 
mierobial  taxa  eontributing  metagenomie  features  to  the  eode.  Notably,  the  likelihood  of  a  false 
negative  did  not  appear  to  depend  sensitively  on  the  time  sampling  interval,  as  a  few  weeks  to 
nearly  a  year  were  roughly  equivalent.  Codes  based  on  gene-level  features  eonsistently 
outperformed  OTU-based  eodes.  Relative  to  gene-level  codes,  OTU-level  eodes  not  only 
depended  upon  more  taxa,  but  also  required  the  inclusion  of  less-abundant  taxa  (whieh  tended  to 
be  less  stable)  to  aehieve  uniqueness.  On  the  other  hand,  gene-level  eodes  were  able  to 
incorporate  multiple  distinguishing  features  from  the  most  abundant  taxa  of  an  individual,  and 
were  therefore  more  robust  to  temporal  variation.  Comparing  metagenomie  codes  to  a  validation 
eohort  of  previous ly-unseen  subjeets  suggested  that  eodes  would  tend  to  remain  unique  in 
populations  of  order-of-magnitude  hundreds  of  individuals. 

The  results  of  this  study  were  published  in  PNAS  (PMID  25964341)  in  eollaboration  with 
Brendan  Bohannon  (University  of  Oregon)  and  Katherine  Lemon  (Forsyth  Institute).  It  was 
presented  at  the  2015  Dana-Farber  Caneer  Institute  Biostatistics  and  Computational  Biology 
seminar  series,  the  2015  BioC  Bioeonduetor  annual  meeting,  the  2015  Canadian  Institute  for 
Health  Researeh  International  Speaker  seminar  series,  the  2015  Simons  Foundation  Symposium 
on  Genomies  in  Single  Cells  and  Mierobiomes,  the  2015  Harvard  CATALYST  Understanding 
Biomarker  Scienee  workshop,  the  2015  Channing  Division  of  Network  Medieine  Theodore  L. 
Badger  Lecture  series,  the  2014  University  of  Oregon  Institute  for  Theoretical  Sciences  seminar 
series,  the  2014  Statistieal  and  Applied  Mathematieal  Seienees  Institute  Bioinformatics  Opening 
Workshop,  the  2014  META  Center  for  Systems  Biology  symposium,  the  2014  Dalhousie 
University  Centre  for  Comparative  Genomies  and  Evolutionary  Bioinformaties  and  Mierobiome 
User  Group,  the  2014  Keystone  Symposium  on  Exploiting  and  Understanding  Chemieal 
Biotransformations  in  the  Human  Mierobiome,  and  the  2014  Intelligent  Systems  for  Moleeular 
Biology  (ISMB)  eonference.  The  project  was  led  by  researeh  assoeiate  Dr.  Eric  Eranzosa. 

Co-occurrence  and  Co-exclusion  Patterns  in  the  Human  Mierobiome 

CCREPE  (Compositionality  Corrected  by  REnormalization  and  PErmutation)  has  evolved  sinee 
its  ineeption  during  this  projeet  from  an  ad  hoe  approaeh  to  a  posteriori  investigation  of 
assoeiations  in  mierobial  eommunity  data  (initially  named  ReBoot)  to  a  generalizable  Bayesian 
method  for  any  eomposition  data.  Briefly,  typieal  eorrelation  measures  sueh  as  Pearson  or 
Spearman  eorrelation  produee  false  positives  (referred  to  as  spurious  eorrelations)  when  applied 


to  compositional  data,  i.e.  measurements  in  whieh  all  values  sum  to  a  fixed  eonstant. 
Proportions,  in  which  fractions  sum  to  1  or  100%,  are  a  common  example;  these  arise  frequently 
in  ecology,  since  for  example  microbial  taxa  are  only  measurable  as  counts  or  as  fractions  of 
total  community  composition.  CCREPE  corrects  the  nominal  statistieal  significance  of  putative 
eorrelations  in  sueh  data  in  order  to  report  only  the  significance  of  assoeiation  above  and  beyond 
that  expected  due  to  compositionality  alone. 


CCREPE's  Bayesian  model  (Fig,  2)  assumes  that  a  single  composition,  Cj  =  (Qi,  ....Ci^pY ,  is 
generated  by  the  normalization  of  a  basis,  Xj  =  (Xj  i, ... ,  X^YY  .  That  is, 

Xi 


We  also  assume  that  samples  are  i.i.d.,  sueh  thatXj  ~  FxY  therefore  Cj  ~  Fc(-).  Note  that 
Fc(-)  is  determined  from  FxY  by  the  transformation  from  Xj  to  Cj  via  normalization.  The 
eovarianee  and  eorrelation  struetures  of  the  basis  are  denoted  by  =  [Pxjjf]^ 

respectively,  and  the  covariance  and  correlation  of  the  composition  by 

=  [Pcjj'l  Thus,  in  order  to  determine  which  compositional  correlations  are  significant  in  the 
basis,  we  assess  the  null  hypothesis  that  Pxjjf  —  0  (or,  equivalently,  Uxjjf  =  0)  for  features  j  and 
j',  remembering  that  in  real  data  only  the  composition  and  not  the  basis  is  observed. 


Figure  2:  Plate  diagram  for  the  CCREPE  Bayesian  model  of 
significant  associations  in  compositional  data.  Unobserved 
correlations  2  in  an  underlying  basis  R  are  generated  from  mean 
M  and  standard  deviation  a  ,  yielding  a  distribution  of  true 
abundances  that  is  assumed  to  be  lognormal  (LN).  Only  the 
normalized  compositions  C  are  observed. 


Given  that  X  ~  Fx{-)  with  covariance  Z;^  =  and  expeetation  [ix,  a  Taylor  expansion 

around  yields  the  approximate  eovarianee  for  g(X)  — ,  denoted  by  Z^  =  [ctc /r]  • 

LjXj 


Defining  o)  =  this  gives: 


Xc 


(I  -  o)l^)Z(I  -  mI’^Y 


allowing  us  to  predict  the  behavior  of  the  compositional  correlation  from  the  basis  parameters 
that  generate  it.  In  partieular,  for  two  features  in  the  ease  where  oyjy,  =  0  and  j  ^  ]',  then: 
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Thus  the  eovarianee  (Jc.jji  will  be  large  and  positive  if  both  features  take  up  a  large  proportion  of 
the  composition  but  their  variability  is  small  relative  to  the  total  variability  in  the  basis. 
Conversely,  covariance  will  be  large  and  negative  if  both  features  take  up  a  large  portion  of  the 


composition  and  a  large  portion  of  the  total  variability,  as  this  arrangement  reduees  the 
composition  (approximately)  to  one  of  two  parts.  Finally,  compositional  covariance  will  also  be 
large  and  negative  if  one  feature  takes  up  a  small  portion  of  the  composition  with  large 
variability  but  the  other  takes  up  a  large  portion  with  small  variability,  because  the  feature  with 
large  variability  forees  the  other  to  move  in  the  opposite  direetion  after  normalization  (Fig,  3). 


Figure  3:  Deriving  eonditions  under  which  spurious  correlations  emerge.  Three  examples  that  capture  the 
causes  of  spurious  correlation  (A)  in  compositional  data  between  (B)  two  features  of  high  abundance  and  high 
proportional  variance,  (C)  two  features  of  high  abundance  and  low  proportional  variance,  and  (D)  one  feature  of 
high  abundance,  one  low,  and  high  proportional  variance.  Either  high  mean  /j.  or  standard  deviation  a  is  sufficient, 
the  latter  surprisingly  so  even  in  cases  where  feature  means  are  low. 

Over  the  eourse  of  developing  CCREPE,  we  also  derived  a  novel  eeological  similarity  measure 
to  use  with  it,  the  NC-  or  N-dimensional  Cheekerboard  seore.  The  NC-score  extends  the 
cheekerboard  score  from  binary  presence-absence  variables  to  ordinal  values  by  re-defining 
patterns  of  eo-variation  and  co-exelusion  as  follows: 

1 .  We  define  a  eo-variation  pattern  as  a  2x2  submatrix  of  the  form 

^  \  a  <  b,  c  <  d  }, 

or  its  eonverse 

^  \  a  >  b,  c  >  d  }, 

where  a,  b,  c,  d  E  [0,  n-l].  Biologically,  this  pattern  describes  two  microbes  that  co-vary  in 
eoneert  between  two  samples,  i.e.  are  positively  associated. 

2.  Similarly,  we  define  a  co-exelusion  pattern  as  a  2x2  submatrix  of  the  form 

^\a>b,  a>c,  d>c,  d>b}, 

or  its  converse 

^  \  a  <  b,  a  <  c,  d  <  c,  d  <  b  }, 

where  a,  b,  c,  d  E  [0,  n-l].  Biologically,  this  pattern  deseribes  two  mierobes  with  different 
relative  abundanees  between  two  samples,  i.e.  negatively  assoeiated.  In  the  special  binary  case  of 
n  =  2,  these  two  patterns  eollapse  to  the  standard  eheckerboard  unit.  Otherwise,  NC-score  is 
equivalent  to  Kendall’s  tau  caleulated  on  “binned”  abundances  rather  than  ranks,  whieh  we 
perform  by  default  using  thresholds  of  “zero”  (relative  abundanee  =  0),  “very  low”  (>  0,  <  lE-4), 
“low”  (>  lE-4,  <  .01),  “medium”  (>  .01,  <  .25),  and  “high”  (>  .25,  <  1). 


Finally,  we  have  applied  CCREPE  to  two  relevant  mierobial  datasets,  approximately  5,500  16S 
rRNA  gene  taxonomie  profiles  derived  from  the  Human  Mierobiome  Projeet  as  published  in 
2012,  and  to  approximately  2,000  metagenomie  speeies-level  taxonomie  profiles  derived  from 
newly  sequeneed  metagenomes  spanning  the  same  subjects  analyzed  using  MetaPhlAn2.  The 
former,  published  as  PMID  22807668,  included  a  global  network  of  3,005  significant  co¬ 
occurrence  and  co-exclusion  relationships  between  197  clades  occurring  throughout  the  human 
mierobiome.  This  network  revealed  strong  niche  specialization,  with  most  microbial  associations 
occurring  within  body  sites  and  a  number  of  accompanying  inter-body  site  relationships. 
Microbial  communities  within  the  oropharynx  grouped  into  three  distinct  habitats,  which 
themselves  showed  no  direct  influence  on  the  composition  of  the  gut  microbiota.  Conversely, 
niches  such  as  the  vagina  demonstrated  little  to  no  decomposition  into  region-specific 
interactions.  Diverse  mechanisms  underlay  individual  interactions,  with  some  such  as  the  co¬ 
exclusion  of  Porphyromonaceae  family  members  and  Streptococcus  in  the  subgingival  plaque 
supported  by  known  biochemical  dependencies.  These  differences  varied  among  broad 
phylogenetic  groups  as  well,  with  the  Bacilli  and  Fusobacteria,  for  example,  both  enriched  for 
exclusion  of  taxa  from  other  clades.  Comparing  phylogenetic  versus  functional  similarities 
among  bacteria,  dominant  commensal  taxa  (such  as  Prevotellaceae  and  Bacteroides  in  the  gut) 
often  competed,  while  potential  pathogens  (e.g.  Treponema  and  Prevotella  in  the  dental  plaque) 
are  more  likely  to  co-occur  in  complementary  niches. 

With  the  latest  Bayesian  model  for  CCREPE,  we  assessed  species-level  profiles  from  ~2,000 
metagenomes  (Fig,  4).  This  yielded  a  much  smaller  network  of  104  within-  and  between-site 
associations,  almost  all  within-site,  spanning  115  site-specific  clades.  Overall,  these  recapitulated 
the  basic  characteristics  of  earlier  16S-based  networks,  including  little  between-site  interaction 
and  few  "hub"  microbes  (scale-freeness).  We  also  observed  a  cluster  of  co-variation  in  stool, 
which  contained  microbes  known  to  be  involved  in  the  health  of  the  gut  mierobiome,  including 
Escherichia  coli  and  several  clades  IV  and  XI Va  Clostridia.  These  new  methods  will  allow  the 
derivation  of  significant  co-variation  networks  from  high-dimensional  compositional  data, 
particularly  the  detection  of  species  and,  eventually,  sub-species  level  ecological  interactions 
within  the  human  mierobiome  and  other  microbial  communities. 

Figure  4:  Species-level  microbial 
associations  significant  by  CCREPE  in 
~2,000  HMPl-II  metagenomes.  Since 
publication  of  approximately  700  body¬ 
wide  metagenomes  during  the  Human 
Mierobiome  Project,  an  additional  -1,300 
metagenomes  have  been  sequenced 
spanning  100  individuals  at  three  time 
points  each.  CCREPE  was  run  on  the 
residuals  of  a  random  effects  model 
accounting  for  repeated  measures  to 
identify  significant  associations  between 
taxa  within  and  across  body  sites. 

The  manuscript  describing  CCREPE  is  currently  in  preparation  by  lead  Ph.D.  student  Emma 
Schwager,  and  during  the  course  of  the  project  period  it  has  included  contributions  from  software 
developer  Dr.  George  Weingart  and  M.S.  student  Craig  Bielski.  CCREPE  has  been  presented  at 
the  2014  Intelligent  Systems  for  Molecular  Biology  (ISMB)  conference,  the  2014  Program  in 
Quantitative  Genomics  conference,  the  2014  Statistical  and  Applied  Mathematical  Sciences 


Institute  Program  on  Beyond  Bioinformatics:  Statistieal  and  Mathematical  Challenges,  the  2014 
University  of  Oregon  Institute  for  Theoretical  Sciences  seminar  series,  the  2014  META  Center 
for  Systems  Biology  symposium,  the  2014  University  of  Washington  Genome  Sciences  seminar 
series,  the  2014  Human  Microbiome  Project  consortium  meeting,  the  2013  Weizmann  Institute 
Systems  Biology  program  seminar,  the  2013  Channing  Department  of  Network  Medicine 
seminar  series,  the  2013  Harvard  Medical  School  Systems  Biology  program  seminar  series,  the 
2013  Tufts  University  Computer  Science  department  colloquium,  the  2013  Human  Microbiome 
Science:  Vision  for  the  Future  NIH  workshop,  the  2013  General  Meeting  of  the  American 
Society  of  Microbiology,  the  2013  Harvard  School  of  Public  Health  Bioinformatics  Core  forum, 
the  2013  Los  Alamos  National  Laboratory  seminar  series,  and  the  2012  Harvard  Medical  School 
Systems  Biology  program.  In  addition,  CCREPE  has  been  an  analysis  methodology  in  four 
published  studies  (PMIDs  22807668,  24629344,  24843156,  22699609)  and  the  American  Gut 
study  currently  in  review. 

A  Hierarchical  Probabilistic  Model  of  Microbial  Community  Structure:  Sparse  Data 
Observations  for  the  Simulation  of  Synthetic  Abundances  (sparseDOSSA) 

While  many  statistical  methods  have  been  developed  to  facilitate  analysis  of  metagenomic  data, 
to  date  there  have  been  few  efforts  to  benchmark  these  methods  in  an  accurate  and  systematic 
manner  -  a  critical  challenge  to  the  developers  of  these  methods  as  well  as  the  methods’  end- 
users.  To  address  this  and  to  provide  a  generalizable  model  of  microbial  community  profiles,  we 
developed  SparseDOSSA  (Sparse  Data  Observations  for  the  Simulation  of  Synthetic 
Abundances):  a  hierarchical  model  of  microbial  ecological  population  structure.  SparseDOSSA 
is  capable  of  simulating  realistic  metagenomic  data  with  known  correlation  structures  and  thus 
provides  a  gold  standard  to  enable  benchmarking  of  statistical  metagenomics  methods. 

SparseDOSSA’s  model  captures  the  marginal  distribution  of  each  microbial  feature  as  a 
truncated,  zero-inflated  log-normal  distribution,  with  parameters  derived  in  turn  from  a  parent 
log-normal  distribution  (Fig,  5).  The  model  can  be  effectively  fit  to  reference  microbial  datasets 
in  order  to  parameterize  their  microbes  and  communities,  or  to  simulate  synthetic  datasets  of 
similar  population  structure  (including  the  optional  addition  of  known  correlation  structures).  We 
demonstrated  SparseDOSSA’s  utility  in  three  applications:  1)  accurately  modeling  microbial 
community  diversity  profiles  using  a  small  number  of  fit  parameters  trained  on  baseline  (healthy) 
human  microbial  communities  and  communities  from  inflammatory  bowel  disease  (IBD) 
patients;  2)  generating  synthetic  communities  with  simulated  environmental  and  ecological 
correlation  structures;  and  3)  recapitulating  the  results  of  an  earlier  clustering  analysis  describing 
microbial  response  to  diet  type  in  mice.  These  applications  comprise  the  most  common 
downstream  analyses  applied  in  metagenomic  sequencing  experiments,  and  thus  demonstrate 
SparseDOSSA’s  utility  as  a  general-purpose  aid  for  evaluating  statistical  methods  in  microbial 
community  analysis. 
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Figure  5:  SparseDOSSA  provides  a  generative  hierarehieal  Bayesian  model 
for  mierobial  eommunity  taxonomie  profiles.  This  comprises  a  two-layer 
structure  in  which  each  layer  is  controlled  by  a  lognormal  distribution. 
Individual  microbial  features  (second  layer)  are  assumed  to  be  drawn  from  a 
lognormal  distribution  with  marginal  mean  and  standard  deviation  cr,.  In 
combination  with  feature-specific  sparsity  (i.e.  expected  fraction  of  zeros)  pi,  Yij 
is  generated  as  the  number  of  reads  of  feature  i  in  sample  j.  To  generate  the 
overall  distribution  of  the  population  of  microbial  features,  are  parameters 
connecting  the  marginal  mean  parameters  to  the  marginal  standard  deviation 
parameters  and  marginal  sparsities,  respectively,  and  po,  So  are  the  mean  and 
standard  deviation  of  the  overall  lognormal  distribution  (first  layer).  Rectangles 
denote  replication  of  the  model  within  the  rectangle,  i.e.  plate  structure,  with  the 
number  of  replication  is  labeled  at  the  bottom-right  comer. 


We  validated  SparseDOSSA’s  ability  to  accurately  fit  and  simulate  microbial  communities  with 
ecological  patterns  recapitulating  those  in  a  variety  of  real  community  types,  by  comparing  the 
distribution  of  microbe-specific  means  and  beta-diversity  profile  in  simulated  data  to  the  real 
data  that  are  used  to  train  the  model.  This  ultimately  allows  us  and  others  to  quantitatively 
benchmark  analysis  methods  for  microbiome  features,  their  ecological  properties,  and  their 
associations  with  environment  and  host  phenotypes.  We  first  assessed  the  degree  to  which 
SparseDOSSA’s  fitted  two-layer  model  captured  the  marginal  variation  of  microbial  community 
taxonomic  profiling  data  across  all  microbes.  To  that  end,  we  first  compared  the  trend  of  rank 
average  abundance  of  all  microbes  in  two  real  datasets  (Morgan  Genome  Bio  2012,  250  samples 
and  158  genera,  and  HMP  Nature  2012,  16S  posterior  fornix  subset)  to  those  of  simulated 
datasets  generated  by  the  model  fitted  on  these  real  datasets.  The  fitting  was  repeated  using  both 
the  naive  and  fully  Bayesian  fitting  procedures,  with  simulated  datasets  generated  based  on 
SparseDOSSA’s  two-layer  model  setting  parameters  to  be  the  estimates  from  the  fitting  step  and 
with  the  same  dimension  as  the  real  dataset.  This  verified  the  model’s  ability  to  capture  the 
variation  pattern  of  marginal  microbial  community  profiles  and  the  validity  of  our  naive  fitting 
method  as  a  fast  approximation  of  the  full  Bayesian  fitting  method  (Fig,  6). 


(a) 


SparseDOSSA  calibrated  on  gut  microbiome 
samples  from  inflammatory  bowel  disease  patients 


(b)  SparseDOSSA  caiibrated  on  vaginai  microbiome 
samples  from  healthy  HMP  participants 
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Figure  6:  The  SparseDOSSA  model  aceurately  eaptures  feature  mean  distributions  and  beta  diversities  of 
microbial  communities.  A)  Rank  abundances  of  log-transformed  feature  specific  means  for  data  from  Morgan  et  al 


and  from  SparseDOSSA  simulated  data  based  on  this,  using  naive  estimates  and  the  pointwise  95%  posterior 
interval  derived  from  fully  Bayesian  fitting  of  the  same  model.  B)  MDS  analysis  on  Bray-Curtis  distances  of 
taxonomic  profiles  from  1 90  vaginal  microbial  communities  in  the  Human  Microbiome  Project  split  into  a  training 
subset  (95  samples)  for  model  fitting,  a  validation  set  (95  samples),  and  95  simulated  samples  from  the  resulting 
SparseDOSSA  model  (all  containing  172  microbial  features). 

When  used  as  a  model  for  simulating  realistie  mierobial  eommunity  data,  one  major  function  of 
SparseDOSSA  is  to  impose  a  known  "phenotypic  response"  in  synthetic  microbial  features.  This 
is  done  by  capturing  the  overall  diversity  pattern  of  a  community  and  subsequently  also  inducing 
artificial  correlations  between  features  and  sample  properties  such  as  environment  or  phenotype. 
We  verified  the  model's  ability  to  include  detectable  categorical  feature-metadata  associations  in 
its  simulated  output  by  artificially  associating  nine  randomly  chosen  features  and  a  binary 
environmental  metadatum  with  250  samples  and  the  Morgan  et  al  dataset  as  template.  All 
synthetic  associations  were  detected  among  the  17  features  of  greatest  effect  size,  with  no  false 
negatives.  Interestingly,  several  apparent  false  positives  (i.e.  features  that  are  spuriously 
differentially  abundant)  occur  at  this  level  of  significance,  caused  not  by  the  SparseDOSSA 
model  but  by  the  known  effects  of  compositionality  (i.e.  relative  abundance  normalization)  in 
ecological  data.  Simulated  associations  with  a  synthetic  continuous  metadatum  were  similarly 
successful,  including  the  model's  ability  to  correctly  capture  sparsity  patterns  in  microbial 
features  (Fig,  7). 

(a)  Validation  of  binary  metadata  (b)  Validation  of  continuous  metadata 

spike-in  and  detection  spike-in  and  detection 


Figure  7:  Simulating  categorical  or  continuously  valued  population  variability  among  microbial  community 
samples.  A)  Differences  of  mean  relative  abundances  between  two  classes  of  a  simulated  binary  metadatum  based 
on  Morgan  et  al,  along  with  the  empirical  inter-quatile  range  of  all  features  as  contrasted  between  metadatum  levels. 
B)  Correlation  of  one  feature  into  which  an  association  to  a  continuously  variable  metadatum  has  been  spiked  (Y 
axis)  with  that  metadatum's  value  (X  axis). 

The  SparseDOSSA  manuscript  is  currently  under  review  as  written  by  project  lead  Ph.D.  student 
Boyu  Ren  and  research  scientist  Dr.  Eric  Franzosa,  with  contributions  over  the  course  of  the 
project  from  undergraduate  research  assistants  Joseph  Moon  and  Yiren  Lu.  SparseDOSSA  has 
been  presented  at  the  2015  Dana-Farber  Cancer  Institute  Biostatistics  and  Computational 
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Inferring  microbial  community  metagenomes  from  marker  gene  sequences 


Profiling  phylogenetic  marker  genes,  such  as  the  16S  rRNA  gene,  is  a  key  tool  for  studies  of 
microbial  communities  but  does  not  provide  direct  evidence  of  a  community’s  functional 
capabilities.  We  developed  PICRUSt  (Phylogenetic  Investigation  of  Communities  by 
Reconstruction  of  Unobserved  States),  a  computational  approach  to  predict  the  functional 
composition  of  a  metagenome  using  marker  gene  data  and  a  database  of  reference  genomes. 
PICRUSt  uses  an  extended  ancestral-state  reconstruction  algorithm  to  predict  which  gene 
families  are  present  and  then  combines  gene  families  to  estimate  the  composite  metagenome. 
Using  16S  information  alone,  PICRUSt  recaptures  key  findings  from  the  Human  Microbiome 
Project  and  accurately  predicts  the  abundance  of  gene  families  in  host-associated  and 
environmental  communities,  with  quantifiable  uncertainty.  Phytogeny  and  function  are  thus 
sufficiently  linked  that  this  ‘predictive  metagenomic’  approach  has  begun  to  provide  useful 
insights  into  the  thousands  of  uncultivated  microbial  communities  for  which  only  marker  gene 
surveys  are  currently  available  (Fig,  8). 


Figure  8:  The  PICRUSt  workflow.  PICRUSt 
is  composed  of  two  high-level  workflows: 
gene  content  inference  (top)  and  metagenome 
inference  (bottom).  Beginning  with  a  reference 
phytogeny  and  a  gene  content  table,  gene 
content  inference  predicts  genes  for  each  taxon 
with  unknown  content,  including  marker  gene 
copy.  Metagenome  inference  takes  a 
taxonomic  profile,  where  taxa  correspond  to 
tips  in  the  reference  tree,  as  well  as  the  copy 
number  of  the  marker  gene  in  each  taxon  and 
its  gene  content  and  outputs  a  metagenome 
table  on  a  per-sample  basis. 


We  developed  PICRUSt  to  predict  the  functional  composition  of  a  microbial  community’s 
metagenome  from  its  16S  profile.  This  is  a  two-step  process.  In  the  initial  'gene  content 
inference'  step,  gene  content  is  precomputed  for  each  organism  in  a  reference  phylogenetic  tree. 
This  reconstructs  a  table  of  predicted  gene  family  abundances  for  each  organism  (tip)  in  the  16S- 
based  phylogeny.  Because  this  step  is  independent  of  any  particular  microbial  community 
sample,  it  is  pre-calculated  only  once.  The  subsequent  'metagenome  inference'  step  combines  the 
resulting  gene  content  predictions  for  all  microbial  taxa  with  the  relative  abundance  of  16S 
rRNA  genes  in  one  or  more  microbial  community  samples,  corrected  for  expected  16S  rRNA 
gene  copy  number,  to  generate  the  expected  abundances  of  gene  families  in  the  entire 
community. 

The  value  of  PICRUSt  depends  on  the  accuracy  of  its  predicted  metagenomes  from  marker  gene 
samples  and  the  corresponding  ability  to  recapitulate  findings  from  metagenomic  studies.  The 
performance  of  PICRUSt  was  first  evaluated  using  the  set  of  530  HMP  samples  that  were 
analyzed  using  both  16S  rRNA  gene  and  shotgun  metagenome  sequencing.  We  treated  HMP 
metagenomic  samples  as  a  reference  and  calculating  the  correlation  of  PICRUSt  predictions  from 
paired  I6S  samples  across  6,885  resulting  orthologous  groups.  Predictions  had  high  agreement 
with  metagenome  sample  abundances  across  all  body  sites  (Spearman  r=0.82,  p<0.001).  As  a 
targeted  example,  we  also  tested  PICRUSt’s  accuracy  in  specifically  predicting  the  abundance  of 


glycosaminoglycan  (GAG)  degradation  functions,  which  are  more  abundant  in  the  gut  than 
elsewhere  in  the  body.  Using  the  same  differential  enrichment  analysis  on  both  PICRUSt  and 
metagenomic  data  yielded  identical  rankings  across  body  sites  and  very  similar  quantitative 
results,  suggesting  that  PICRUSt  predictions  can  be  used  to  infer  biologically  meaningful 
differences  in  functional  abundance  from  16S  surveys  even  in  the  absence  of  comprehensive 
metagenomic  sequencing. 

Next,  we  then  evaluated  the  prediction  accuracy  of  PICRUSt  in  metagenomic  samples  from  a 
broader  range  of  habitats  including  mammalian  guts,  soils  from  diverse  geographic  locations, 
and  a  phylogenetic  ally  complex  hypersaline  mat  community  (Fig.  9).  These  habitats  represent 
more  challenging  validations  than  the  human  microbiome,  as  they  have  not  generally  been 
targeted  for  intensive  reference  genome  sequencing.  Because  PICRUSt  benefits  from  reference 
genomes  that  are  phylogenetically  similar  to  those  represented  in  a  community,  this  evaluation 
allowed  us  to  quantify  the  impact  of  increasing  dissimilarity  between  reference  genomes  and  the 
metagenome. 


Average  16S  distance  to  nearest  reference  genome  (NSTI) 


Figure  9:  PICRUSt  accuracy  across  various 
environmental  microbiomcs.  Prediction  accuracy  for  paired 
16S  rRNA  marker  gene  surveys  and  shotgun  metagenomes 
(y-axis)  are  plotted  against  the  availability  of  reference 
genomes  as  summarized  by  the  Nearest  Sequenced  Taxon 
Index  (NSTI;  x-axis).  Accuracy  is  summarized  using  the 
Spearman  correlation  between  the  relative  abundance  of  gene 
copy  number  predicted  from  16S  data  using  PICRUSt  versus 
the  relative  abundance  observed  in  the  sequenced  shotgun 
metagenome.  In  the  absence  of  large  differences  in 
metagenomic  sequencing  depth  (see  text),  relatively  well- 
characterized  environments,  such  as  the  human  gut,  have  low 
NSTI  values  and  can  be  predicted  accurately  from  16S 
surveys.  Conversely,  environments  containing  much 
unexplored  diversity  (e.g.  phyla  with  few  or  no  sequenced 
genomes),  such  as  the  Guerrero  Negro  hypersaline  microbial 
mats,  tended  to  have  high  NSTI  values. 


To  characterize  this  effect,  we  developed  the  Nearest  Sequenced  Taxon  Index  (NSTI)  to  quantify 
the  availability  of  nearby  genome  representatives  for  each  microbiome  sample.  NSTI  is  the  sum 
of  phylogenetic  distances  for  each  organism  in  the  taxonomic  profile  to  its  nearest  sequenced 
reference  genome,  measured  in  terms  of  substitutions  per  site  in  the  marker  (i.e.  16S  rRNA)  gene 
and  weighted  by  the  frequency  of  that  organism  in  the  profile.  As  expected,  NSTI  values  were 
greatest  for  the  phylogenetically  diverse  hypersaline  mat  microbiome  (mean  NSTI=0.23  +/-  0.07 
s.d.),  least  for  the  well-covered  HMP  samples  (mean  NSTI=0.03  +/-  0.02  s.d.),  mid-range  for  the 
soils  (mean  NSTI=0.17  +/-  0.02  s.d.)  and  varied  for  the  mammals  (mean  NSTI=0.14  +/-  0.06 
s.d.).  Also  as  expected,  the  accuracy  of  PICRUSt  in  general  decreased  with  increasing  NSTI 
across  all  samples  (Spearman  r=-0.4,  p<  0.001)  and  within  each  microbiome  type  (Spearman  r=- 
0.25  to  -0.82,  p<0.05).  For  a  subset  of  mammal  gut  samples  (NSTI<0.05)  and  all  of  the  soil 
samples  that  we  tested,  PICRUSt  produced  accurate  metagenome  predictions  (Spearman  r=0.72 
and  0.81,  respectively,  both  p<0.001).  Both  the  mammal  and  hypersaline  metagenomes  were 
shallowly  sequenced  at  a  depth  expected  to  be  insufficient  to  fully  sample  the  underlying 
community’s  genomic  composition,  thus  likely  causing  the  accuracy  of  PICRUSt  to  appear 
artificially  lower  for  these  communities.  Although  the  lower  accuracy  on  the  hypersaline 


microbial  mats  community  (Spearman  r  =0.25,  p<0.001)  confirms  that  PICRUSt  must  be  applied 
with  eaution  to  the  most  novel  and  diverse  eommunities,  the  ability  to  ealeulate  NSTI  values 
within  PICRUSt  from  16S  data  allows  users  to  determine  whether  their  samples  are  traetable  for 
PICRUSt  predietion  prior  to  running  an  analysis.  Moreover,  the  evaluation  results  verily  that 
PICRUSt  provides  useful  funetional  predietions  for  a  broad  range  of  environments  beyond  the 
well-studied  human  mierobiome. 
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